DOCUXBHT RESUME 

ED 300 398 TM 012 373 



AUTHOR 
TITLE 

PUR DATE 
NOTE 



PUB TYPE 



Hambleton, Ronald K.; Rogers, H. Jane 

Detecting Biased Test Items: Comparison of the IRT 

Area and Nantel-Haenszel Methods* 

Apr 88 

37p.; Paper presented at the Annual Meeting of the 
American E'^ucational Research Association (New 
Orleans, LA, April 5-9, 1988). 
Speeches/conference Papers (150) — Reports - 
EvaludtivrVFeasibility (142) 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



MF01/PC02 Plus Postage. 

Comparative Analysis; High Fchools; High School 
Students; *Item Analysis; *Latent Trait Theory; 
*Statistical Bias; *Test Bias; Test Construction; 
Test Items; White Students 

*Item Charac< eristic curve Area; *Hantel Haenszel 
Procedure; Native Americans; New Mexico High School 
Proficiency Examination 



ABSTRACT 

The agreement between item response theory*l>ased and 
Mantel Haenszel (Mil) methods in identifying biased items on tests was 
studied. Data came from item responses of four spaced samples of 
1,000 examinees each— two samples of 1,000 Anglo-American and two 
samples of 1,000 Native American students taking the he^f Mexico High 
School Proficiency Examination in 1982* In addition, a matched group 
analysis was conducted using a third sample of 650 Native Americans 
and 650 Anglo Americans. The item characteristic curve area and the 
MH methods were used. The consistency of classification of items into 
biased and not-biased was in the 75 to 80% range for both methods. 
When the unreliability of item bias statistics was taken into 
account, both methods gave similar results. Discrepancies between 
methods were due to bias from intersections of item characteristic 
curves and the choice of interval over which item bias was defined. 
The Mantel-Haenszel method, with a minor modification or two, 
provides an acceptable approximation to the item response theory 
based methods. Five data tables euid eight graphs show study results. 
(Author/SLD) 



**************************************************************** 

* Reproductions supplied by EDRS are the best that can be made 

* from the original document. 



etecting Biased Test Items: Coraparison of the IR'^ Area and 

Mantel-Haenszel Methods 



Ronald K. Hambleton and H. Jane Rogers 
University of Massachusetts at Amherst 



[P^his document hau been reproduced as 
received from the person of organization 
originatir>g «l 

C Minor Changes have been made to improve 
reproduction quality 



US DEPARTMENT OF EDUCATION 
Office of Fiucational Research arnl improvement 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 





• Points of view or opinions stated in th.s docu- 
ment do not necessarily represent official 
OERl position or policy 



TO THE EDUCAliONAL RESOURCES 
INFORMATION CENTER (ERIC) " 



BEST COPY AVAILABLE 



Detecting Biased Test Items: Comparison of the IRT Area and 
Nantel-Haenszel Methods^ * ^ 

Ronald K. Hambleton and H. Jane Rogers 
University of Massachusetts at Amherst 



Abstract 



IRT-based methods for identifying biased test items have 
considerable appeal, hoire^'er, difficult problems sometimes arise in 
applying them. The Mantel-Haenszel method (NH) shares sone of the 
desirable features of IRT-based item bias methods but not most of the 
difficulties. The main purpose of this study was to determine the 
degree of agreement between the IRT-based and MH methods in identi- 
fying biased items and, when the two methods led to different 
results to identify possible reasons for the discrepancies. 

Data for the study caiae from the item responses of Anglo* 
American and Native-American students who were administered the 1982 
New Mexico High School Proficiency Exam. Two samples of 1000 
students from each group were used in the item bias analyses. Item 
bias methods studied were the ICC Aroa method (using 3*parameter 
ICCs) and the Mantel-Haenszel method. 

Tha main findings^ were that (1) the consi«^tency of 
classifications of items into biased and not-biased categories across 
replications was in the 75 to ^0% range for both methods, and (2) 
when the unreliability of item bias statistics was taken into 
account, the two methods led to very similar results. Discrepancies 
between methods were due to bias resulting from intersecting ICCs 
(the Mantel-Haenszel method could not identify these items) and the 
choice of interval over which item bias was defined (the IRT method 
results depended on the choice of interval). The is^plications of th?t 
results for practitioners seem clear: The Mantel-Haenszel method, 
with a minor modification or two, provides an acceptable approxi- 
mation to the IRT-based methods. 



NewMex.3.1 



5/30/88 



Detecting Biased Test Items: Comparison of the IRT Area and 
Mantel-Haenszel Methods^ 

Ronald K. Hambleton and R. Jane Rogers 
University of Massachusetts at Amherst 

Recent attention to the detection of biased test items has 
resulted in the generation of a plethora of item bias methods (see, 
for example, Berk, 1982). Perhaps o£ greatest interest at present 
are IRT-based methods and the Mantel -Haenszel method. IRT-based 
methods have become popular and indeed, are considered "theoretically 
preferred'* by some researchers (e.g., Shepard, Camilli, & Averill, 
1981; She>ard, Camilli, & Williams, 1984) because of their close con- 
nection to the most widely accepted definition of item bias. This 
definition states that an item is biased if examinees of the same 
ability but from different sub-groups do not have the same probabili- 
ty of a correct response to the item. Thus the study of item bias 
within an IRT framework is a matter of comparing the item character- 
istic curves (ICCs) for the two sub-groups of interest (Hambleton & 
Swaminathan, 1985). Choice of mathematical form of the ICCs and 
approach to representing the differences between ICCs give rise to 
many of the IRT-based methods. 

The Mantel-Haenszel method, proposed by Holland and Thayer 
(1986, 1988), also compares the probabilities of a correct response 



^A paper presented at the annual meeting of AERA, New Orleans, 

1988. 

^ Laboratory of Psychometric and Evaluative Research Report No. 
175 . Amherst, MA: School of Education, University of Massachusetts, 
1988. 

NewNex.3.2 



2 



in the tiro groups of interest for examinees of the same ability, 
although its calculation is very different from the IRT-based 
methods (Holland & Thayer, 1988). 

Vbile IRT*based methods may have theoretical appeal, they have 
several drawbacks in practice, particularly when the three*parameter 
IRT model is used. High costs associated with running an IRT computer 
program such as LOGIST, large sample requirements, and sometimes poor 
parameter estimates, make implementation of IRT item bias methods 
problematic, if not impossible in some situations. Moreover, careful 
attention must be given to the scaling of item parameters, the choice 
of ability interval over which bias is measured, and (sometimes) the 
determination of a "cutoff** value for interpreting the results. The 
Mant^l-Haenszel method, on the other hand, shares so^.e of the desira- 
ble features of, the IRT methods but not most of the difficulties. 
Computer programs for calculating the statistic are easily written; 
the cost of the analysis is low (probably under ten dollars); sample 
sizes need not be as large as for IRT-based methods; and a signifi- 
cance test is available to aid in interpreting the bias statistics. 
This simplicity is achieved at the expense of some generality, 
however; the calculation of a Mantel-Haenszel statistic may be 
considered analogous to comparing two item characteristic curves 
based on a one-parameter logistic model. Thus the Hantel-Haenszel 
statistic is not designed to detect non-uniform item bias. 

NewMex.3.2 



ERLC 



5 



3 



Tifo different arguments can be used to support an interest in 
the Nantel-Haenszel method. For IRT advocates r the Mantel-Haenszel 
method may seem an acceptable approximation. For others r the Mentel- 
Haenszel method may seem preferable because of the logic underlying 
the method, its conceptual simplicity, and the availability of suit- 
able significance tests. In viev of the vide acceptance of IRT*based 
methods and the current interest in the Mantel-Haenszel method, a 
comparison of the results vher applied to the same test data seemed 
timely. A previous study by Hambleton, Rogers, and Arrasmith (1983) 
provided some initial findings of the high agreement between the 
Mantel-Haenszel and one of the IRT*based methods (ICC Area method) 
when methodological problems associated with the methods were taken 
into account. The desirability of repeating their study with other 
datasets was noted by the authors. 

Purposes 

The main purpose of this research was to carry out a detailed 
analysis of the item bias results obtained from an IRT-based item 
bias method and the Mantel-Raenszel method. Specifically, interest 
was centered on the degree of agreement between the methods in 
identifying biased items, and on possible reasons for disagreements 
when they were found. The research was primarily intended to 
determine the consequences of substituting an IRT-based item bias 
method for the easier to use and more convenient Mantel -Haenszel 
method. 

NewNex.3.3 




6 



4 



A second purpose of the study vas to examine the behavior of 
the item bias statistics vhen the ability distributions of the two 
groups of interest are considerably different. Widely differing 
ability distributions can be expected to affect the quality of IRT 
parameter estimation in one or both groups, and hence vill influence 
the values of the IRT-based item bias statistics. Tlie effect of dis- 
crepant score distributions on the Mantel-Haenszel statistic is more 
difficult to predict, hence it was of interest to study the situa- 
tion. 

Method 

Description of the Test Data and Examinee Samples 

The samples used in the study were drawn from a dataset contain- 
ing the responses of approximately 23,000 students to the 1982 New 
Mexico High School Proficiency Exam (MMHSPE) . The MMHSPE is a 150- 
item test which assesses "life skills" in five major areas: Knowl- 
edge of Community Resources, Consumer Economics, Government and Law, 
Mental and Physical Health, and Occupational Knowledge. Of the t-otal 
group of students, approximately 8,000 were Anglo-American and 2,600 
were Native Americjsin. Th:.s dataset was chosen for the study because 
of the widely discrepant score distributions of the Anglo- and Native 
Americans, and because of the large number of items flagged as 
potentially biased in an earlier : tem bic^ investigation (Hambleton, 
Martois, & Williams, 1983). 

NewMex.3.4 



ERLC 



5 

Description of the Item Bias Statistics 

ICC Area Method. The ICC Area method entails the calculation of 
the area between the item characteristic curves obtained for each 
group separately (Rudner, Getson, & Knight, 1980). The area is cal- 
culated over a specified ability interval, which in this study was 
from the lower group mean minus three standard deviations to the 
upper group mean plus three standard deviations. Because there is no 
known sampling distribution for the area statistic under the null 
hypothesis of no group differences, items are typically ranked 
according to the values of the statistic and those with the highest 
values flagged as potentially biased. In this study, a "cutoff" 
value was obtained by carrying out an analysis on two randomly equi- 
valent groups (the two Native American samples). Since there is no 
bias present, the largest area statistic obtained serves as an 
indicator of the greatest value of the statistic likely to occur by 
chance (for a further discussion, see Rogers & Hambleton, in press). 
This approach is not ideal; however, it does provide an approximate 
answer to the cut-off score determination problem. 

The Mantel-Haenszel Statistic. The Mantel-Haenszel method works 
directly with the item responses for the two groups (referred to in 
the psychometric literature as the reference group and the focal 
group). As described earlier, examinees are first sorted into score 
groups according to total test score, resulting in up to (n + 1) 
score groups. Vithin the jth score group, a 2x2 table of frequencies 
is set up: 



O HewHex.3.5 

ERIC 



Group 



Item 


Score 


1 


0 




Bj 


Cj 


Dj 


mi J 


mo J 



Reference Aj Bj orj 

Focal Cj Dj nrj 

Tj 

Ai, Bj, Cj , and Oj correspond to the numbers of examinees in the 

four cells of the 2x2 Table: nsj , nrj, mij, and moj are the 

marginals. Tj is the number of examinees in the jth score group vho 

attempted the item under investigation. The Mantel-Haenszel Chi- 

S re Test Statistic has the form: 

(iTjAj - TjE(Aj ) i - 2)g 
TjVar(Aj) 

whare E(Aj) = nRj mij/Tj 
and 

Var(Aj) = '''' 
Tj (Tj-1) 

(From Holland & Thayer, 1988). 

The Nantel-Haenszel statistic tends to be large , an indicator of item 
biaSf when item performance in the reference and focal groups over 
the (n + 1) score groups is consistently different. For example, if 
the reference group outperforms the focal group by 10% on the average 
across the (n + 1) score groups, the MH chi-square test statistic 
will be large, and the correct interpretation is that the item is 
biased against the focal group. 



NevHex.3.6 



7 



Computer Programs 

LOGIST. To estimate item characteristic curves, the LOGIST 
program (Wood & Lord, 1976) was used. LOGIST estimates parameters 
using the method of max'^mum likelihood. A modified Nevton's method 
i? used to solve the likelihood equations. Estimation is conducted 
in stages, in which first the item parameters are held fixed while 
ability parameters are estimated, then the ability estimates obtained 
are held fixed while item parameters are estimated. 

In this study, three*parameter item characteristic curves were 
fitted to all items for each sample. 

IRTBIAS. This FORTRAN V program, written by the second author, 
calculates the area between item characteristic curves as described 
earlier. Output from the LOGIST program for the two groups of 
interest may be input directly into IRTBIAS. The two sets of b- 
parameter estimates are first placed on a common metric by scaling 
both to a mean of zero and standard deviation of one. The other item 
parameter estimates and ability estimates are then transformed 
accordingly (Hambleton & Swaminathan, 1985). Item characteristic 
curves are calculated and the area between them computed ever an 
ability interval specified by the user. 

Output from the program includes the value of the total area 
between the ICCs for each item as well as the varlues of the 
"positive" and "negative" areas, i.e., the area for which the ICC for 

Newllex.3.7 

ERIC 10 



8 



the reference group is higher than that or the focal group, and vice 
versa. 

HH STATISTIC. This program, also written by the second author 
in FORTRAN V, calculates the Mantel-Haenszel item bias statistic. By 
default, (n+1) score groups are constructed, where n is the number of 
items. 

Mantel-Haenszel statistics are computed in two steps, as 
recommended by Holland (1985). First, score groups are constructed 
using total scores based on all items. Mantel-Haenszel statistics 
are then calculated for all items. Those items with Mantel-Haenszel 
values exceeding the tabulated chi-square value at the .01 level of 
significance are identified. Next, total scores are recalculated 
excluding these items. With this "purified" criterion for ability, 
score groups are reformed and the Mantel-Haenszel statistics are 
computed once more. 

Output from the program includes frequency distributions for the 
two sub-groups, and results of the analysis for the first ana second 
steps. For each item, p-values for each sub-group, the common odds- 
ratio, the Mantel-Haenszel chi-square statistic, and the 
corresponding z-value are printed. 

Procedure 

For the purposes of the study, four spaced samples of 1000 exam- 
inees each were drawn: two samples were Anglo-American and two were 
Native American. To facilitate the analysis and reduce computer 



NewMex.3.8 



ERLC 11 



time, only 75 of the 150 items were used. Items were chosen such 
that VGry easy items (p > .90) and items with very low discrimination 
(r < .1) were excluded; such items cause difficulties in IRT para- 
meter estimation and often lead tc unusually unstable item bias 
statistics. Three-parameter IRT models were fitted to each of the 
four samples separately. 

With two Anglo-American and two Native American samples, two 
independent bias analyses could be carried out. The second compar- 
ison was conducted to enable examination of the consistency with 
which each bias statistic flagged items across samples. The Mantel- 
Haenszel and ICC Area method statistics were calculated for each item 
in each of the two comparisons. 

To study the effect of the discrepancy in the score distribu- 
tions of the two ^jroups on the Area method statistics, a variation of 
the statistic was also computed. The ability interval over which the 
area is calculated was modified to cover the ability scale from tvo 
standard deviations below the Native American group uean to two stan- 
dard deviations above the same mean. By restricting the interval in 
this way, attention was focused on that part of the ability scale 
where most of the Native American examinees were located, and hence 
where differences between the Anglo- and Native American ICCs were of 
greatest practical significance. 

To study the effect of the score distribution differences on the 
Mantel-Haenszel statistic, a matched group analysis was carried out. 
For this analysis, a third sample of Native Americans was selected 



NewMex.3.9 



10 

such that the distribution of scores more closely matched that of the 
Anglo-American sample. A sample of 650 Native Americans was obtained 
and compared with a sample of 650 Anglo-Americans. Both the Mantel- 
Haenszel and Area method statistics were calculated. 

Results 

As can be seen from Figures 1 and 2, the score distributions for 
Anglo- and Native Americans were considerably different. The mean 
and standard deviation for the first Anglo-American sample were 54.23 
and 10.65 respectively, and for the first Native American sample, the 
values were 36.55 and 11.19, respectively. The mean and standard 
deviation lor the second Anglo-Ameri m sampl: were 54.21 and 10.61 
respectively, and for the second Native . r^erican sample, the values 
were 37.01 and 11.86, respective^-. 

Insert Figures 1 and 2 and Table 1 about here. 

After obtaining three-parameter model estimates for examinees 
and items for the four samples (two Anglo-American, two Native 
American) , absolute-valued standardized residuals were calculated to 
determine the appropriateness of the fits between the model and the 
test data (Hambleton & Swaminathan, 1985). Table 1 provides a 
summary of the results. The results show clearly that there was a 
very close match between the three-parameter model and each set of 
tost data (see, for example, Hambleton & Rogers, in press). The re- 
sults were slightly better for the Anglo-American samples, but the 
fits were excellent for all four datasets. 



NewNex.3.10 



13 



11 



Insert Tables 2, 3, and 4 about here. 

Using thp two independent Anglo- vs. Native American compari* 
sons, the consistency with which the Area method and the Mantel- 
Baenszel method flagged items as potentially biased was examined. 
Table 2 lists the items which were flagged by the Area method in one 
or both comparisons, md indicates those items which were consistent- 
ly identified. The cut-off score for this method was .468. Table 3 
reports similar information for the Mantel-Haenszel method. The cut- 
off score for this method was 6*60. Vhere an item was flagged in one 
sample but the statistic was borderline in the other sample, the 
result was treated as consistent. The consistency results for the two 
methods are summarized in Table 4. 

From Tables 2 and 3, it can be seen tL^t both methods displayed 
considerable instabilitv across samples. Using the Area method, 20 
of the 75 items were flagged in one comparison but not the other. 
Using the Mantel-Haenszel method, 15 items were inconsistently 
flagged. Overall consistency for the Area method was 73% and for the 
Mantel-Haenszel, 80%. This moderate level of consistency is somewhat 
surprising, considering that all of the results were based on 1000 
examinees in each group, and disturbing in view of the fact that in 
most situations, the practitioner would not have the luxury of a 
cross-validation sample* 

When the Area method and the Mantel-Haenszel method were com- 
pared, the results were more encouraging. This comparison, however, 
was carried out with items which were consistently identified as 
biased across samples with the same method. Table 5 lists the items 

14 



12 



and values of the bias statistics for the 16 items consistently 
flagged by one or both nethods. Of the 14 items consistently identi- 
fied by the Area nethod across the tiro comparisons and the nine items 
consistently identified by the Mantel-Haenszel, seven items were 
common. Thus, those items identified by the Mantel-Raenszel method 
were more or less a subset of those identified by the Area method. 

Insert Table 5 about here. 

Attention was then focused on the nine items consistently 
flagged by one bias method but not the other. From Table 5, it can 
be seen that two of the items consistently flagged by the Area method 
(items 57 and 102) were flagged in one comparison by the Nantel- 
Haenszel statistic. Hence, the discrepancy in results for these 
items may have occurred due to a Type II error with the Mantel- 
Haenszel statistic. Conversely, item 11 r which was consistently 
flagged by the Hantel-Haenszel statistic, was flagged in one compari- 
son by the Area method, suggesting a Type II error resulting fron. the 
use of this method. 

Insert Figures 3 to 8 about here. 

For the remaining six items, ICCs for the two groups were 
plotted and are displayed in Figures 3 through 8. Figures 3 through 
7 show the ICCs for the five items which were detected by the Area 
method but not the Hantel-Haenszel method. For four of these items 
(items 28, 30, 92, and 129), the ICCs crossed markedly. It is thus 

NewMex.3.12 



15 



13 



not surprising that the Mantel-Haenszel statistic did not detect 
these items, since it is not designed to detect non-^uniform bias, 
'i'he discrepancy for item 88 is less easy to explain. Thi*; item is 
potentially biased against the Anglo-!lmerican sample. Over the 
interval (-3.7 to 3.5) the ICCs are clearly different and hence the 
item vas flagged by the Area method. The MH method did not detect 
the item as biased because fev Anglo-Americans scored in the region 
of the scale where the largest differences vere observed. The mean 
ability score for the Anglo-Americans vas about ,90 (.91 in the first 
sample and .89 in the second); the standard deviation of ability 
scores vas about .86 (.87 in the first sample and .85 in the second). 
Clearly, only a small percent of Anglo-Americans (perhaps about 15% 
were in the region of the scale where differences were observed. 

Figure 8 shows the ICCs for item 60, which was flagged by the 
Mantel-Haenszel b"t not the Area method. The curves are uniformly 
but not markedly different. Since the most pronounced differences 
were in the region on the ability scale where many Native American 
examinees scored, it was likely that the Mantel-Haenszel method would 
detect item 60 as biased. The average ability score in the two 
Native American samples was about -.60 and the standard deviation was 
about l.a4. In contrast, the Area method addressed the differences 
in the ICCs over a much wider interval on the ability scale (-3.7 to 
3.5). Over the full scale, the differences were relatively modest 
and therefore the item was not identified by the Area method. 



NewNex.3.13 



14 



When the Area statistic was computed over the restricted 
interval (Native American mean score ± 2 standard deviations, which 
was, approximately, -2.7 to 1*5), some changes in the ranking of 
items were observed. However, all but two of the items (items 92 and 
102) that were consistently identified in the original analysis were 
consistently flagged over the narrowed and lower interval. Item 92 
was an item where the ICCs crossed in the modified range, and did not 
diverge widely within the interval. Other items for which the ICCs 
crossed and which had large area values in the original analysis also 
tended to De ranked lower in the modified analysis. This result 
suggested that the modified area statistic might be more closely 
related to the Mantel-Baenszel results than the original area 
statistic for the dataset used in this study. When rank-order 
correlations were calculated, this proved to be the case. In sample 
1, the rank-order correlation between the original area statistic and 
the Mantel-Haenszel statistic was .32, and between the modified area 
statistic and the Mantel-Haenszel, .48. 

When the matched sample analysis was carried out, the Mantel- 
Haenszel lesults changed very little. All items which were 
previously consistently identified by the Mantel-Haenszel method were 
flagged again, along with four others that had been flagged in the 
sample 1 analysis. One item (item 28) was flagged which had not 



NewMex.3.14 



ERIC 



17 



15 



previously been flagged in either comparison. The Area method results 
shoved greater change, as might be expected L\ viev of the reduction 
in sample size. Only five of the 14 items previously identified vere 
flagged in this analysis. 

Discussion 

Several major points emerge from these results. First, both 
IRT-based item bias methods and the Mantel-Haenszel method are 
somevhat unreliable in identifying biased items. The Area method 
results vere consistent across samples (of 1000) about 73% of the 
time; the Hantel-Haenszel results vere consistent about 80% of the 
time. This finding reinforces our preference for considering items 
only "potentially" biased on the basis of the value of the bias 
statistic. Also, this result helps to explain the moderate agreement 
reported in the measurement literature among item bias methods 
concerning items flagged as potentially biased. The fact is that 
studies of convergence of item bias methods are influenced greatly by 
the unreliability of item bias statistics. 

Second, there is substantial agreement between an IRT-based item 
bias method (the Area method) and the Mantel-Haenszel method in the 
detection of uniformly biased test items. Also, the IRT-based item 
bias method appears to detect non-unif ormly biased items; the Mantel* 
Haenszel method does not. 

When the interval over which the area statistic was calculated 
was changed, the rankings of items according to the value of the 

NewMex.3.15 

IS 



16 

statistic also changed. Restricting the ability interval to focus on 
the region of the scale where most of the focal group is distributed 
may lead to the id ntification of fewer non*uniformly biased items 
with the Area methou, and hence, greater congruence with the Mantel- 
Haenszel results. Of course, in practice, the choice of interval 
over which item bias is measured is an important methodological con* 
sideration, and must be considered when interpreting it'^m bias 
statistics. 

The distribution of test scores appears to have little impact on 
the Mantel -Haenszel results. Matching the groups according to test 
score distribution before calculating the bias statistics did not 
substantially change the results. The Area method results were 
influenced to a much greater extent, although this may have been due 
in part to the reduction in sample size which was necessary to 
achieve matching. In any case, the Mantel-Haenszel method showed 
much greater stability in the face of reduced sample size than did 
the Area method. 

The ik>plications of the results of this study for practice seem 
clear. First, practitioners should be reminded about the 
unreliability of item bias statistics. This means that they should 
be encouraged to use large samples in their analyses whenever 

f 

possible and interpret item bias statistics with a fair degree of 
caution. Second, the evidence suggests that the Mantel-Haenszel 
method can be safely substituted for IRT-based methods if safeguards 
are put in place to detect non-uniformly biased items. These items 

^ NewNex.3.16 

ERIC 1 9 



17 



are likely to go undetecti^d by the Mantel-Haenszel method. One 
safeguard would be to routinely compare the direclion of the 
differeuce in p-values for the ttfo groups of interest across scor? 
groups. If the direction of the difference favored one group at test 
scores belov a certain test score and favored the other group above 
the test score, non-uniform bias could be suspected. Test items 
shoving this pattern of performance in the tvo groups, though not 
identified by the Mantel-Haenszel method, could also be studied for 
possible bias. Analyses like the one proposed can easily be 
incorporated into computer programs to carry out the Hantel-Haenszel 
method and provide some protection across non-uniform biased items 
going undetected. Other simple safeguards, such as graphing 
techniques, could also be incorporated into the method to detect non* 
uniformly biased test items. 



NeifMex.3.17 



20 



References 



9erk, R. A. (Ed.). (1982). Handbook of methods for detecting test 
bias . Baltimore, HO: The Johns Hopkins University Press. 

Hambleton, R. K., Martois, J. S., ft Williams, C. (1983, April). 

Detection of biased test items with item response models . Paper 
presented at the annual meeting of AERA, Montreal. 

Hambleton, R. K., Rogers, H. J., ft Arrasmith, D. (1988). 

Identifying potentially biased test items: A comparison of the 
Mantel-Haenszel statistic and several item response theory 
me t hods . Laboratory of Psychometric and Evaluative Research > 
Report No, 154 . Amherst, MA: School of Education, University 
of Massachusetts. 

Hambleton, R. K., ft Rogers, H. J. (in press). Promising directions 
for assessing item response model fit to test data. Applied 
Psychological Measurement s 

Hambleton, R. K., ft Swaminathan, H. (1985). Item response theory ; 
Principles and applications . Boston: Kluwer. 

Holland, P. W. (1985). On the study of differential item difficulty. 
Princeton, NJ: Educational Testing Service. 

Holland, P. W., ft Thayer, D. T. (1986). Differential item function- 
ing and the Mantel-Haenszel procedure. Technical Report No. 86- 
31. Princeton, NJ: Educational Testing Service. 

Holland, P. W., ft Thayer, D. T. (1988). Differential item 

performance and the Mantel-Haenszel procedure. In H. Wainer ft 
H. I. Braun, Test validity (pp. 129-145). Hillsdale, NJ: 
Lawrence Erlbaum Associates. 

Rogers, H. J., ft Hambleton, R. K. (in press). Evaluating computer- 
simulated baeline statistics for interpreting item bias 
statistics. Educational and Psychological Measurement . 

Rudner, L. M., Getson, P. P., ft Knight, D. L. (1980). A Monte Carlo 
comparison of seven biased item detection techniques. Journal 
of Educational Measurement . 17, 1-10. 



Shepardr L. A.r Camillir 6w & Averill, H. (1981). Comparison of 
procedures for detecting test-item bias with both internal and 
external ability criteria. Journal of Educational Statisti cs. 
6, 317-375. 

Shepard, L. A.r Camilli, G., & Villiams, D« N« (1984). Accounting 
for statistical artifacts in item bias research. Journal of 
Educational Statistics , 9, 93-138, 

Wood, R, L,, 4 Lord, F, H, (1976), A user's guide to LOGIST. 
Research Memorandum . Princeton, NJ: Educational Testing 
Service. 



Figure 1. Test Score Distributions for the first Anglo- American (AA) 
and Native Amer.^can (NA) sanple. 




15 25 35 45 55 65 75 

Test Score 



Figure 2. Test Score Distributions for the second Anglo-American (AA) 
and Native American (NA) sample. 




Figure 3. Anglo- and Native American ICCs for item 28. 



I. o- 

• e- 

. 7- 

• e- 



.Anglo-American 
.Native American 



-| — I — I — I — I — I — 1 — I I I ! I I I. 



Figure 4. Anglo- and Native American ICCs for item 30. 



~l — I — I — 1 — I — I — I — I — I — I — 1 1 1 — I 

-a.Sxa.o-2.B-2.o-i. s-i. o -. s .o .8 i.o 1.5 a^o 2.5 s.o a.8 

THETR 



24 



Figure 5* Anglo- and Native American ICCs for item 88. 




Anglo-American 



» — . Native American 



1 — 1 — 1 — 1 — I — I — I — I — I — I — I I 1 — I 

-aL9-a.o-^s-^o-i.5-t.o .o .5 i.o us 2.0 2.5 aio ais 



Figure 6. Anglo- and Native American ICCs for Item 92, 




Anglo-American 



Native American 



T — I — I — I — 1 — I — I — 1 — I I I I I 

_aus-3i.o-2.s-xo-t.s-t.o -wS .0 .s 1.0 i.s 2.0 &0 &s 

THETO 

25 



Figure 7. Anglo* and Native American ICCs for Item 129. 




Figure 8. Anglo- and Native American ICCs for item 60. 




Anglo-American 
Native American 



1 1 1 1 — I \ — I — I — I — 1 — I — 1 — I — I 



.5 UO us 2.0 2.S 9LO 9L8 



THETR 2g 



Table 1 

Siuwary of the IRT Standardized Residuals 



Star'^'rdized Residuals 



SaBple 


10 .1 
1 V X 1 


11 to 2 1 

1 A to « 1 


12 ff% 11 




Anglo* Aaerican, Saaple 1 


72% 


25% 


3% 


.74 


Anglo* Aaerican, Saaple 1 


73% 


22% 


5% 


.76 


Native American, Saaple 1 


66% 


29% 


5% 


.83 


native American, Sample 2 


70% 


24% 


6% 


.80 



ERJC 27 



Table 2 



IteB Statistics for Potentially Biased Test Items 

Identified using the IRT Area Method^ 
(Anglo-Aiericans vs Native Aaericans, N - 1000) 



Smfle 1 Sanple 2 



Itat itm 








m 


Bias 




M 


1 


V. 


Bias 






b 




b 


a 


Statistic 


b 


a 


b 


a 


Statistic 


Stability 


u 


-0.43 


0.77 


-0.00 


0.98 


0.354 


-0.85 


0.54 


-0.03 


0.81 


A ^ MK 

0.645 




14 


-1.23 


0.36 


-2.09 


0.29 


0.382 


-0.50 


0.49 


-2.71 


0.19 


1.036 




IS 


-1.32 


0.42 


-0.99 


0.81 


0.566 


-0.81 


0.68 


-0.98 


0.86 


0.215 




23 


0.71 


0.93 


0.73 


0.34 


0.87!; 


0.47 


0.82 


0.61 


0.48 


0.451 




28 


0.41 


0.47 


0.81 


0.16 


0.903 


0.45 


0.42 


0.64 


0.20 


0.657 


X 






0.72 


A ftl 

O.oJ. 


0.32 


0.736 


A CO 

0.58 


A ^A 

0.79 


A Mt 

0.47 


A 1^ 

0.37 


A 1A1 

0.701 


X 


31 


-0.33 


1.04 


-0.63 


0.56 


0.520 


-0.65 


0.76 


-0.56 


0.75 


0.069 




39 


-0.04 


0.62 


-0.19 


0.58 


0.115 


0.21 


0.91 


-0.09 


0.39 


0.783 




42 


-0.34 


0.54 


-0.78 


0.60 


0.201 


-0.51 


0.51 


0.02 


0.83 


0.536 




43 


-0.53 


0.53 


-0.60 


0.40 


0.269 


-0.78 


0.58 


-0.59 


0.35 


0.491 




50 


1.81 


0.55 


1.49 


0.94 


0.488 


1.50 


0.85 


1.56 


0.90 


0.225 




56 


-1.92 


0.34 


-0.76 


0.50 


0.647 


0.25 


1.14 


0.42 


0.69 


0.195 




57 


-1.44 


0.50 


-0.74 


0.83 


0.584 


-1.35 


0.53 


-0.69 


0.86 


0.562 


X 


(7 


2.11 


0.84 


1.76 


0.76 


0.277 


2.32 


0.87 


1.61 


0.81 


0.509 




69 


-1.17 


0.55 


-1.19 


0.42 


0.245 


-1.80 


0.37 


-O.»o 


0.55 


0.521 




75 


0.87 


0.91 


1.66 


0.53 


0.608 


0.89 


0.90 


l.'S 


0.87 


0.281 




78 


1.88 


1.45 


2.59 


0.59 


0.687 


1.81 


1.19 


1.75 


1.02 


0.215 




82 


-0.50 


0.58 


0.17 


0.61 


0.509 


-0.66 


0.46 


0.03 


0.76 


0.626 


X 


88 


-0.62 


0.90 


-1.27 


0.62 


0.516 


-0.T7 


0.87 


-1.34 


0.56 


0.493 


X 


92 


0.86 


1.00 


1.37 


0.44 


0.686 


0.82 


1.18 


1.38 


0.37 


0.916 


X 



^Althoui^ a three p u9mt m wMl was fitted to the data, the e-paraarters for all item 
vepocted bm nere estiMted to be .20. 



X desiooates test iUm utiich wm identified as cooriste&tly potentially biased. 



Table 2 (cont.) 



IteB Statistics for Potentially Biased Test Itens 

Ideutified using the IRT Area Method^ 
(Anglo-Aiericans vs Native Aaericans, N ^ looo) 









Sanple 1 








Saple 2 






Test Itm 








lA 


Bias 






1 




Bias 






b 




b 


a 




K 
0 




b 


a 


OtaUBUC 




93 


1 44 




1.22 


1.33 






1 OA 


1.09 


1.79 






101 


0.34 


0.40 


1.56 


0.45 


0.838 


0.56 


0.43 


0.87 


0.83 


0.602 


X 


102 


0.27 


0.69 


-0.48 


0.48 


0.584 


-0.09 


0.47 


-0.06 


0.28 


0.488 


X 








-0.59 


0.45 






U.Jo 


-0.56 


0.43 


A HOI 


Y 

A 


110 






-0.74 


0.50 




A.vD 


u. jd 


-0.86 


0.41 


U.D94 


Y 
A 


115 


0.25 


0.45 


-0.29 


0.28 


0.534 


0.46 


0.39 


-0.06 


0.30 


0.380 




118 

AAV 


-1 56 


0 2S 


-2.26 


0.30 


V. J70 




n Oft 


-2.96 


0.29 


v.0p9 




122 


-1.38 


0.34 


-0.10 


0.70 


0.945 


-1.19 


0 34 


-0.08 


0.59 


0.789 


X 


123 


-1.07 


OM 


-0.73 


0.70 


0.489 


-1.18 


0.38 


-0.73 


0.50 


0.335 




125 


0.77 


0.21 


0.96 


0.30 


0.322 


0.90 


0.18 


0.84 


0.44 


0.751 




127 


-1.19 


0.67 


-1.15 


0.44 


0.381 


-1.05 


0.67 


-1.13 


0.38 


0.523 




128 


-0.59 


0.64 


-0.10 


1.28 


0.617 


-0.73 


0.56 


0.05 


1.16 


0.732 


X 


129 


0.40 


0.56 


0.17 


0.35 


0.477 


0.37 


0.67 


0.79 


0.24 


0.941 


X 


130 


1.67 


0.32 


0.55 


0.61 


0.747 


1.77 


0.36 


0.71 


0.50 


0.577 


X 



*AltIioiuoh & t h r oe pa i— tar Bodel ms fitted to the data, the c-paraneters for all itens 
reported bere were estiatted to be .20. 



X detigDates test iteM lAdch were identified as ocnsistently potentially biased. 



1^ 



3u 



31 



Table 3 



Ittm Statistics for Potentially Biased Test Items 

Identified using the Hantel-Haenszel Hethod^ 
(Anglo-Aiericans vs Native Americans, N » 1000) 



Saqde 1 



SmtpLe 2 







Alt 




IVI 






11 

Ml 




IVI 








b 


Ji 

CI 




Ji 

CI 




o 


m 

CI 


K 

o 






wi Oil! 1 1 


u 


-0.43 


0.77 


-0.00 


0.98 


17.49 


-P.85 


0.54 


-0.03 


0.81 


20.56 


% 


11 

MM 


-1 M 




-2 09 


0 29 




-0 >)0 


0 49 


-2 71 
«• rX 


0 19 


6 HI 




27 


0 31 


0 64 


-0 IS 


0 ffi 


7 67 


0 27 


0 63 


-0 09 


0 72 


1 99 




1^ 
J9 


-0 22 




0 20 


X* #9 


25 ft3 


0 01 


1 15 

X« J7 


0 19 


1 19 


1 65 










v«7X 




?«X f 


' 0 67 


0 9ft 


0 95 


0 96 


1 60 




47 


-0.37 


1.19 


-0.64 


1.34 


0.62 


-0.20 


1.56 


-0.67 


1.14 


8.66 








0 81 


0 20 


0 78 


12 07 


0 02 


0 94 


0 15 


1 07 


2 01 














12 10 

Xa*Xv 


-0 


0 16 


-0 M 


0 57 


0 11 




ST 


--1 4d 




-0 74 




7 11 


•-1 35 


0 51 


-0 69 


0 86 


1 91 




w 




0 4fi 

V • w 


-0 S4 


0 54 


S OS 


-1 06 


0 56 


-0 95 


0 39 


7 30 


T 


64 


0.23 


1.06 


0.22 


0.74 


0.02 


-0.00 


0.86 


0.34 


0.85 


9.52 




67 


2.11 


0.84 


1.76 


0.76 


3.78 


2.32 


0.90 


1.61 


0.81 


12.83 




75 


0.87 


0.91 


1.66 


0.53 


7.01 


0.89 


0.46 


1.25 


0.87 


3.42 




82 


-0.50 


0.58 


0.17 


0.61 


20.17 


-0.67 


0.46 


0.03 


0.76 


8.56 


X 


101 


0.34 


0.40 


1.56 


0.45 


30.87 


0.56 


0.43 


0.87 


0.83 


8.32 


. X 


102 


0.27 


0.69 


-0.48 


0.48 


9.67 


-0.09 


0.47 


-0.06 


0.28 


0.36 




KM 


0.82 


0.85 


0.24 


1.02 


14.94 


0.67 


0.84 


0.39 


0.80 


5.99 




107 


-1.56 


0.36 


-0.59 


0.45 


11.43 


-1.49 


0.38 


-0.56 


0.43 


11.24 


X 


110 


-1.46 


0.43 


-0.74 


0.50 


13.03 


-2.06 


0.36 


-0.86 


0.41 


17.37 


X 


118 


-1.56 


0.28 


-2.26 


0.30 


3.56 


-2.02 


0.28 


-2.96 


0.29 


10.39 





^Altbouib a tbree-pinwter Mdel was fitted to the data, the e-paraaeters for all iteos 
reinrted here nere estiaated to be .20. 



X destdnates test itesi liiich ivere identified as oonsistently potentially biased. 



32 



Table 3 (cont.) 

Item Statistics for Potentially Biased Test Iteas 

Identified using the Mantel-Haenszel Method 
(Anglo-Aaericans vs Native Americans, N = 1000) 



Vast ItcB 



122 
127 
128 
130 



-1.38 
-1.19 
-0.59 
1.67 



0.34 
0.67 
0.64 

0.32 



Savle 1 
b a 



-0.10 
-1.15 
-0.10 
0.55 



0.70 
0.44 
1.28 
0.61 



Bias 

Statistic 



21.U 
8.98 

19.50 
7.49 



M 



Saqde 2 
b a 



Bias 
Statistic 



-1.19 0.34 

-1.05 0.67 

-0.73 0.56 

1.78 0.36 



-0.06 
-1.13 
0.01 
0.71 



0,59 
0.38 
1.16 
0.50 



^Utbou^ a t h ree pa i a BB le r vdel was fitted to the data, the c-par»etecs for all itew 
ronrtad ben mre estiMted to be .20. 



14.00 
5.82 
16.27 
12.59 



Stability 



X 
X 



X dflsifpates test itew which nete identified as consistently potentiaUy biased. 



34 



35 



Table 4 



Suu&ry of Results Concerning Consistency of 
Bias - Non-Bias Classifications of 75 Test Items 
in Two Independent Anglo- vs Native American Comparisons 



Category 


IRT Area 


Method 

Mantel-Haenszel 


Biased, Sample 1; 
Biased, Sample 2 


14 




9 


Biased, Sample 1; 

Non-Biased, Sample 2 


9 




10 


Non-Biased, Sample 1; 
Biased, Sample 2 


11 




5 


Non-Biased, Sample 1; 
Non-Biased, Sample 2 


41 




51 


Number of Consistently 
Classified Items 


55 




60 


Percent of Consistently 
Classified Items 


73% 




80% 



36 



Table 5 



Agreement Between Methods in the Identification 
of Potentially Biased Test Items^ 



Test 


IRT Area Method 


Hantel-Haenszel Method 




Itea 


S-1 


S-2 


S-1 


S-2 


Agreement 


11 


(0.354)« 


0.645 


17.49 


20.56 




28 


0.903 


0.657 


(0.38) 


(0.00) 




30 


0.736 


0.701 


(0.10) 


(4.54) 




57 


0.584 


0.562 


7.34 


(4.94) 




60 


vu. 315; 


(0.349) 


8.08 


7.30 




82 


0.509 


0.626 


20.17 


8.56 


A 


88 


0.516 


0.493 


(2.90) 


(0.46) 




92 


0.686 


0.916 


(0.01) 


(0.11) 




101 


0.838 


0.602 


30.87 


8.32 


X 


102 


0.584 


0.488 


9.67 


(0.36) 




« API 

107 


0.567 


0.581 


11.43 


11.24 


X 


110 


0.465 


0.694 


13.03 


17.37 


X 


122 


0.945 


0.789 


21.11 


14.00 


X 


128 


0.617 


0.732 


19.50 


16.27 


X 


129 


0.477 


0.941 


(2.11) 


(0.41) 




130 


0.747 


0.577 


7.49 


12.59 


X 



^Test items listed in the Table were consistently identified as 
biased by one or both methods. 



•Values reported in brackets were not significant. 



